The setup of the tile part imitated the assignment I did for STA303.
Codes are adapted from BCB420 lecture slides.
The journal page for A3 is: https://github.com/bcb420-2022/Xinyi_Xu/wiki/12.-Data-set-Pathway-and-Network-Analysis
We obtained the raw data from article: Transcriptomic Similarities and Differences in Host Response between SARS-CoV-2 and Other Viral Infection, with GEO(Barrett et al. 2012) id GSE152641(Thair et al. 2021). The data contains 20460 observations with 87 variables, there are in total 86 participants, among them, 62 are COVID-19 patients and 24 are healthy individuals.
We created our normalized data set by removing duplicated and unwanted observations. Box plots and density plots were used to visualize the data and then we perform TMM normalization on our primary cleaned data, after doing so, we noticed that there is a significant difference between normalized and un-normalized data. The entrez_id used by raw data set was transferred into the hgnc_symbol or corresponding gene name for the normalized data set and duplicates were removed again. Our normalized data set has 14413 observations.
Based on the normalized data, we used package limma(Ritchie et al. 2015) and edgeR(Robinson, McCarthy, and Smyth 2010) (McCarthy, Chen, and Smyth 2012) to propose models and calculate the corresponding p-values. Volcano plots and heat maps are graphed for each model. We used a threshold 0.05 to identify genes that are significant, and filter out the unwanted gene names. Three txt files containing upregulate, downregulate, and ranked genes respectively were created for further analysis.
Note the setup section which load packages are hided.
# We load the ranked data we had created in Assignment 2 and skim it.
ranked_data <- read.table(file="covid_rankedgenelist.rnk",
header = TRUE,sep = "\t",
stringsAsFactors = FALSE,
check.names=FALSE)
kable(ranked_data[1:5,1:2], type="html", caption = "Table1. Skim on the ranked data")
| Genename | rank |
|---|---|
| H2BC6 | -16.14800 |
| POLQ | -16.14481 |
| APOBEC3A | -15.45742 |
| APOBEC3A_B | -15.45742 |
| CC2D2A | -14.80352 |
The data set that we use for GSEA is the ranked data that we have created in Assignment 2, this data set includes all the ranked values and no threshold was applied, thus it has 14413 observations and 2 variables, the number of observations is the same as our normalized data set we have used in Assignment 2, detailed codes for creation of ranked data is in Thresholded over-representation analysis section of Assignment 2 and was slightly modified and documented again on journal page for A3.
We used GSEA pre-ranked analysis. The specific values and versions we used are:
Version: 4.2.3
Enrichment statistic: weighted
Max size: 200
Min size: 15
Collapse: No_Collapse
The genesets we used was adapted from the baderlab geneset(Merico et al. 2010a) collection from March 1, 2021 with GOBP all pathways and no GO IEA. (Codes are documented in journal GSEA)
Note the idea of adding tables come from: https://bookdown.org/yihui/rmarkdown-cookbook/kable.html
Table2. Layout of the Enrichment result
| Stats | na_pos | na_neg |
|---|---|---|
| Upregulated | 2205 | 3088 |
| FDR < 25% | 323 | 1092 |
| significantly enriched at nominal pvalue < 1% | 189 | 558 |
| significantly enriched at nominal pvalue < 5% | 376 | 910 |
| Top Gene sets | EUKARYOTIC TRANSLATION INITIATION%REACTOME DATABASE ID RELEASE 79%72613 | HALLMARK_INTERFERON_ALPHA_RESPONSE%MSIGDB_C2%HALLMARK_INTERFERON_ALPHA_RESPONSE |
In total 5293 gene sets are used. na_pos gene sets are genes that are upregulated in healthy human individuals, and na_neg gene sets are genes that are downregulated in healthy human individuals.
Table3. Top Genes Sets for na_pos and na_neg
| Stats | EUKARYOTIC TRANSLATION INITIATION%REACTOME DATABASE ID RELEASE 79%72613 | HALLMARK_INTERFERON_ALPHA_RESPONSE%MSIGDB_C2%HALLMARK_INTERFERON_ALPHA_RESPONSE |
|---|---|---|
| Upregulated in class | na_pos | na_neg |
| Enrichment Score (ES) | 0.6688145 | -0.8105112 |
| Normalized Enrichment Score (NES) | 2.506432 | -2.9652746 |
| Nominal p-value | 0.0 | 0.0 |
| FDR q-value | 0.0 | 0.0 |
| FWER p-Value | 0.0 | 0.0 |
| Leading-edge | 81 | 58 |
| Top Gene | EIF4B | EPSTI1 |
It is not straight forward comparison, because the number of genes sets involved in GSEA and thresholded analysis is different, the total gene sets involved in thresholded analysis is 7802, while after filtering, the total GSEA gene sets used are 5293.
But the results we get from thresholded analysis and GSEA are very similar. Both of the top two gene sets we get from GSEA are among the top gene sets we get from REACTOME in thresholded analysis. And by observing the top gene sets of both thresholded analysis and GSEA, they do overlap with each other, like HALLMARK_INTERFERON_GAMMA_RESPONSE%MSIGDB_C2%HALLMARK_INTERFERON_GAMMA_RESPONSE, and GTP HYDROLYSIS AND JOINING OF THE 60S RIBOSOMAL SUBUNIT%REACTOME%R-HSA-72706.2 which are the second top gene sets given by GSEA for na_neg and na_pos, are also among the top gene sets of thresholded analysis, similar examples would be: DEFENSE RESPONSE TO BACTERIUM%GOBP%GO:0042742, CYTOPLASMIC TRANSLATION%GOBP%GO:0002181 and so on.
We use Cytoscape(Shannon et al. 2003) version 3.9.1 for MacOS. Apps like EnrichmentMap(Merico et al. 2010b), and AutoAnnotate(Kucera et al. 2016) are installed from Cytoscape app store.
There are in total 534 nodes and 2771 edges. The thresholds used is listed in the figure below: The FDR q-value cutoff was set to 0.05, and all other values remained default.
Fig1. Thresholds used to create the Enrichment Map
Fig2. Network prior to manual layout
Fig3. Legend
The plugin package used to annotate the network is: AutoAnnotate(Kucera et al. 2016). We used the default choice plus the Layout network to prevent cluster overlap.
Fig4. Annotated Network without prevent cluster overlap
Fig5. Annotated Network without prevent cluster overlap
Fig6. Annotated Network with prevent cluster overlap
Fig7. Legend
We use the publication ready button in the Enrichment map section.
Fig8. Publication ready figure
Fig9. The Collapsed Network with no overlap
Fig10. The Collapsed Network with overlap
There are in total 173 clusters and 41 edges in both Networks. The major themes present in this analysis are: cell cycle, translation and transcription, immune response, viral life cycle, cancer, and DNA repair.
We believe that the result we get matches with the model. As EUKARYOTIC TRANSLATION INITIATION%REACTOME DATABASE ID RELEASE 79%72613 is clearly gene set that is involved in translation and transcription, while HALLMARK_INTERFERON_ALPHA_RESPONSE%MSIGDB_C2%HALLMARK_INTERFERON_ALPHA_RESPONSE is a gene set that involved in immune system, and interferons are proteins which indicate that there are invaders or cancer cells in the body, and this matches with our purpose of investigation as we study the difference between healthy and Covid19 individuals. This result of GSEA is matched with our theme network, there is no clear novel pathways or themes.
The enrichment results support the article: Transcriptomic Similarities and Differences in Host Response between SARS-CoV-2 and Other Viral Infection(Thair et al. 2021), as the article focused on covid 19 and other viral infections which matches with our theme: viral life cycle and immune response. Immune response and cancer was also discussed in the article where the authors claim that neutrophil-to-lymphocyte ratios are used to mark cancer severity(Diao et al. 2020) (Liu et al. 2020) (Lagunas-Rangel 2020) (Qin et al. 2020). The article also stated that translation pathway like translation of ferritin is affected with COVID19 infections. Moreover, the article mentioned that genes involved in neutrophil activation, innate immune response, immune response to viral infection, type-I interferon signaling, cytokine production and lymphocyte differentiation and T-cell activation and regulation(Thair et al. 2021) differ between healthy and Covid19 individuals, which highly matches with our enrichment results.
Novel pathways is DNA repair (notch expression ercc6) which was not discussed in the article.
The overall results matches with our Assignment #2 thresholded analysis results, as in our analysis in Assignment 2, we found most genes involved in the pathways mentioned above are among the top ranked genes in g:Profiler(Reimand et al. 2007) analysis. The results are consistent and do not show significant difference. The slight difference may due to the fact that the total gene sets involved in thresholded analysis is 7802, while after filtering, the total GSEA gene sets used are 5293 as we have discussed in the previous section.
Evidence and discussions that we did in Assignment #2 also supports our enrichment results. Besides that, several articles(Diao et al. 2020) (Liu et al. 2020) (Lagunas-Rangel 2020) (Qin et al. 2020) mentioned in the previous section, provided an evidence to supporting our result: immune response and cancer are among the major themes present in this analysis.
The article(Pánico, Ostrosky-Wegman, and Salazar 2022) claims that viral infections could lead to DNA damage and might also increase the risk to cancer development, which supports our finding that there are major themes: DNA repair and cancer involved in the major themes, though DNA repair is not mentioned in our original paper.